The World Health Organization (WHO) reports that stroke is the second most common cause of death worldwide, accounting for around 11% of all the fatalaties.
Problem Statement
Brain stroke is a leading cause of disability and mortality worldwide, with early detection and treatment being critical to improve outcomes. However, accurately predicting the likelihood of stroke can be a challenging task for medical professionals. Our goal is to develop an accurate and reliable model that can assist healthcare providers in identifying patients at high risk of stroke and implementing preventive measures.
Solution
Our solution involves using data analysis methods to build a predictive model for brain strokes. This model will learn from patient information, including things like age, medical history, lifestyle, and test results. By picking out the most important factors, we'll create a model that helps doctors find people with a higher chance of having a stroke. This way, doctors can step in early and improve the outcomes for patients.
Data Sources
The dataset for this project is created by Federico Soriano and extracted from Kaggle to predict whether a patient is likely to get stroke based on the given attributes.
Data Exploration
A casual glance at the dataset reveals that age is a key factor that increases the risk of getting a stroke. The risk of stroke increases with age, and it is more pronounced after age 50.
Techniques Used
This is an example of a complex, non-linear, classification problem that still requires more explainability. We decided to use 4 techniques to understand which model can give us moderate accuracy and high explainability. The cleaned data sheet was used as the training and validation dataset. The split was kept at 60:40 to ensure adequate data for training. Stroke (yes/no) is the target variable, while the rest are independent variables (age, glucose level, hypertension etc.). We used Azure ML studio and Heuristic Lab (for GA) to create the models.
Logistic Regression
Logistic regression is a widely used statistical method for predicting binary outcomes, due to its simplicity, flexibility, robustness. We used logistic regression for this problem because it can perform well, especially when the predictors are relatively independent of each other.
Decision Tree
Decision trees are popular machine learning techniques often used for classification problems. We used the decision tree for this problem, as it provides a clear and intuitive representation of the decision-making process, thus making it easy to understand and interpret the output. While it can be less effective for a problem where the relationships between different risk factors can be complex and non-linear, we selected this technique because it has high explainability.
Neural Network
We used neural networks because they can capture complex non-linear relationships between the predictor variables and the target variable. Since this data is sufficiently complex and one of the objectives is to achieve high predictive accuracy, a neural network may be a suitable choice. However, neural networks can be harder to interpret and explain, which could be a disadvantage.
Genetic Algorithms
We used GA based on the assumption that it can use natural selection to figure out the right way to model this dataset. However, since GA is predominantly used for optimization problems and can be difficult to interpret and explain, which may be a disadvantage in certain contexts, such as in medical diagnosis, it could be less effective.
Recommendations
Using a decision tree:
Patients are the primary beneficiaries of this analysis. Early detection can help them take preventive measures such as making lifestyle changes or taking appropriate medications to reduce the risk of stroke. So, when the analysis is done and focused on the patients, it is imperative to focus on explainability rather than accuracy. A decision tree is the right model in this context.
Using a neural network:
At the other end of the spectrum are doctors and hospitals. While doctors need a certain measure of explainability, they can use these results for starting preventive care to their patients. Hospitals can use this data to understand where to allocate resources, educate patients about the potential risk factors, and invest in medical technology that can help them diagnose the condition early. For such cases, we can take a hit on explainability to focus on high accuracy. A neural network is the right model in this case.